Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Designing molecules that must satisfy multiple, often conflicting, objectives is a central challenge in molecular discovery. The enormous size of the chemical space and the cost of high-fidelity simulations have driven the development of machine learning-guided strategies for accelerating design with limited data. Among these, Bayesian optimization (BO) offers a principled framework for sample-efficient search, while generative models provide a mechanism to propose novel, diverse candidates beyond fixed libraries. However, existing methods that couple the two often rely on continuous latent spaces, which introduce both architectural entanglement and scalability challenges. This work introduces an alternative, modular “generate-then-optimize” framework for de novo multiobjective molecular design/discovery. At each iteration, a generative model is used to construct a large, diverse pool of candidate molecules, after which a novel acquisition function, qPMHI (multipoint Probability of Maximum Hypervolume Improvement), is used to optimally select a batch of candidates most likely to induce the largest Pareto front expansion. The key insight is that qPMHI decomposes additively, enabling exact, scalable batch selection via only a simple ranking of probabilities that can be easily estimated with Monte Carlo sampling. We benchmark the framework against state-of-the-art latent-space and discrete molecular optimization methods, demonstrating significant improvements across synthetic benchmarks and application-driven tasks. Specifically, in a case study related to sustainable energy storage, we show that our approach quickly uncovers novel, diverse, and high-performing organic (quinone-based) cathode materials for aqueous redox flow battery applications.more » « lessFree, publicly-accessible full text available December 21, 2026
-
Symbolic regression (SR) is an emerging branch of machine learning focused on discovering simple and interpretable mathematical expressions from data. Although a wide-variety of SR methods have been developed, they often face challenges such as high computational cost, poor scalability with respect to the number of input dimensions, fragility to noise, and an inability to balance accuracy and complexity. This work introduces SyMANTIC, a novel SR algorithm that addresses these challenges. SyMANTIC efficiently identifies (potentially several) low-dimensional descriptors from a large set of candidates (from ∼105 to ∼1010 or more) through a unique combination of mutual information-based feature selection, adaptive feature expansion, and recursively applied l 0 -based sparse regression. In addition, it employs an information-theoretic measure to produce an approximate set of Pareto-optimal equations, each offering the best-found accuracy for a given complexity. Furthermore, our open-source implementation of SyMANTIC, built on the PyTorch ecosystem, facilitates easy installation and GPU acceleration. We demonstrate the effectiveness of SyMANTIC across a range of problems, including synthetic examples, scientific benchmarks, real-world material property predictions, and chaotic dynamical system identification from small datasets. Extensive comparisons show that SyMANTIC uncovers similar or more accurate models at a fraction of the cost of existing SR methods.more » « lessFree, publicly-accessible full text available February 12, 2026
-
Cell division cycle 5 (Cdc5) is a highly conserved nucleic acid binding protein among eukaryotes and plays critical roles in development. Cdc5 can simultaneously bind to DNA and RNA by its N-terminal DNA-binding domain (DBD), but molecular mechanisms describing its nucleic acid recognition and the regulation of development through its nucleic acid binding remain unclear. Herein, we present a crystal structure of the N-terminal DBD of MoCdc5 (MoCdc5-DBD) from the rice blast fungus Magnaporthe oryzae. Residue K100 of MoCdc5 is on the periphery of a positively charged groove that is formed by K42, K45, R47, and N92 and is evolutionally conserved. Mutation of K100 significantly reduces the affinity of MoCdc5-DBD to a Cdc5-binding element but not to a conventional myeloblastosis (Myb) domain-binding element, suggesting that K100 is a key residue of the high binding affinity to Cdc5-binding element. Another conserved residue (R31) is located close to the U6 RNA in the structure of the spliceosome, and its mutation dramatically reduces the binding capacity of MoCdc5-DBD for U6 RNA. Importantly, mutations in these key residues, including R31, K42, and K100 in AtCDC5, an Arabidopsis thaliana ortholog of MoCdc5, greatly impair the functions of AtCDC5, resulting in pleiotropic development defects and reduced levels of primary microRNA transcripts. Taken together, our findings suggest that Cdc5-DBD binds nucleic acids with two distinct binding surfaces, one for DNA and another for RNA, which together contribute to establishing the regulation mechanism of Cdc5 on development through nucleic acid binding.more » « less
An official website of the United States government
